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Abstract 

A very important topic in systems biology is developing statistical 
methods that automatically find causal relations in gene regulatory net- 
works with no prior knowledge of causal connectivity. Many methods 
have been developed for time series data. However, discovery methods 
based on steady-state data are often necessary and preferable since ob- 
taining time series data can be more expensive and/or infeasible for many 
biological systems. A conventional approach is causal Bayesian networks. 
However, estimation of Bayesian networks is ill-posed. In many cases it 
cannot uniquely identify the underlying causal network and only gives 
a large class of equivalent causal networks that cannot be distinguished 
between based on the data distribution. We propose a new discovery algo- 
rithm for uniquely identifying the underlying causal network of genes. To 
the best of our knowledge, the proposed method is the first algorithm for 
learning gene networks based on a fully identifiable causal model called 
LiNGAM. We here compare our algorithm with competing algorithms us- 
ing artificially-generated data, although it is definitely better to test it 
based on real microarray gene expression data. 



1 Introduction 

An important topic in bioinformatics is developing computational methods to 
discover gene regulatory causal networks from static expression data [TH3]- 
Based on the estimated networks, one can compute intervention effects, i.e., 
causal effects, which enable predicting to what extent the expression level of a 
gene changes when that of another gene is externally changed [3] . One can rank 
causal relations between genes according to the existense and/or strengths of 
causal effects. Such a ranking can be used as a priority list to efficiently conduct 
future interventional experiments and obtain solid evidence [5]- 

Many estimation techniques have been proposed for time series data [3HS]- 
Those techniques use temporal information to estimate the underlying gene 
network structure. However, it is not very feasible to obtain time scries data for 
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many biological systems. In fact, many data that have been analyzed as time 
series data are not really longitudinal due to destructive sampling (TUJ. Then, 
discovery methods based on steady-state data could be better suited for such 
non-longitudinal 'time series' data. 

A conventional approach for estimating gene causal networks based on steady 
state data is Bayesian networks [TT1Q2]. However, Bayesian networks suffers 
from the identifiability problem. In the framework of Bayesian networks, many 
networks with different structures give the same conditional independences be- 
tween variables or genes, and in many cases one cannot uniquely estimate the 
underlying causal network without any prior knowledge on the structure [HEI] • 
Thus, causal effects often are not uniquely estimated as well. 

In this paper, we propose an attractive alternative approach that enables 
uniquely estimating a causal gene network. We first model the causal network 
of genes using a non-Gaussian causal model called LiNGAM P3|. LiNGAM 
is a fully identifiable causal model [T3] unlike conventional Bayesian networks 
and has recently attracted much attention in machine learning |15U16j . Then, 
we present a new algorithm for estimating LiNGAM based on data with more 
variables than observations. Finally, we test our approach using artificially- 
generated data. 



We first define our model in Section 12.11 and then propose a new algorithm 
to estimate the model based on data with more variables than observations in 
Section H2 

2.1 Model 

Let us denote by Xi the expression level of gene i (i = 1, • • • ,p). We model 
causal relations of gene expression levels (variables) x,i (i = I, • • • ,p) using a 
linear non-Gaussian causal model called LiNGAM |14| : 



where k(i) is such a causal ordering of genes that they graphically form a di- 
rected acyclic graph (DAG). This means that no later gene directly or indirectly 
regulates any earlier gene, that is, has a directed path on any earlier gene. The 
ei are the external influences or noises and by are connection strengths of gene j 
on gene i. The zero/non-zero pattern of by corresponds to the absence/existence 
pattern of directed edges. External influences follow non-Gaussian continuous 
distributions with non-zero variances and are mutually independent. 

The assumption that external influences are non-Gaussian enables unique 
identification of a causal ordering k(i) and connection strengths by without 
using any background knowledge on the structure [14]. This feature is a big 
advantage [T4UT6] over conventional Bayesian network approaches based on 
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conditional independences and/or Gaussianity [T7JHB] - Though the Gaussian 
approximation has been a common approach |19) , real- world data could be con- 
sidered more or less non-Gaussian. In fact, non-Gaussian data appear in many 
applications including bioinformatics [20112 lj . 

We note that each bij (i ^ j) represents the direct causal effect of Xj on Xi 
and each ay (i ^ j), the (i, j)-th element of the matrix A=(I — B) _1 , the total 
causal effect of xj on Xi [41122] . Based on total causal effects , one can predict 
to what extent the expression level of gene i changes when that of gene j is 
externally changed. Further, based on direct causal effects bij, one can predict 
to what extent the expression level of gene i changes when that of gene j is 
externally changed while those of all other genes than gene i and gene j are 
held fixed. 

Rigorously speaking, the linearity assumption would be more or less violated 
in real- world gene regulatory causal networks. Nonlinear approaches |23H25j 
might be better to model causal relations of genes. However, in general, lin- 
ear methods can often give better results when it is more important to find 
quantitative relations since nonlinear methods usually require very large sample 
sizes [3]. 

2.2 Estimation of the model 

Now, the problem of causal discovery is to estimate a causal ordering k(i) and 
connection strengths bij based on data x only. Several estimation methods 
for LiNGAM [T?l[55] have been proposed that do not require to specify the 
distributions of the non-Gaussian external influences e*. However, they are not 
applicable for such cases with more variables than observations that arc typical 
in gene expression data analysis. Thus, we extend an LiNGAM estimation 
algorithm [55] to cases with more variables than observations. 

In [26j . a direct method was proposed to estimate a causal ordering k(i). This 
leads to more algorithmically reliable results since it is not necessary to resort 
to iterative search in the parameter space. First, it estimates an exogenous 
variable. An exogenous variable is a variable with no parents and can be at the 
top of a causal ordering. One can estimate such a variable that is exogenous 
by finding a variable that minimizes a non-parametric independence measure 
called KGV [37] between the variable and its regression residuals [2BJ. Once 
an exogenous variable is found, then one subtract the effect of the exogenous 
variable from the other variables using linear least squares regression. One 
can find all the causal orders by iterating this [25]. Once a causal ordering 
k(i) is estimated, one can prune or set to zero redundant connection strengths 
among bij by repeatedly applying a sparse regularization method called adaptive 
lasso [35J on each variable and its potential parents. The adaptive lasso [55] is 
a weighted version of a regularization technique for variable selection lasso [29j 
and assumes the same data generating process as LiNGAM: 
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The adaptive lasso assumes that the set of such potential parent variables Xj 
that k(j)<k(i) is known, while LiNGAM estimates the set of such variables. The 
adaptive Lasso penalizes connection strengths bij in L\ penalty by minimizing 
the objective function defined as: 



k(j)<k(i) 



+ A ^ w ij\ b ij\, 
k(j)<k(i) 



where A is a regularization parameter and uiij is a weight for b^ . In 28 , it was 
suggested to use the inverse of the absolute value of the ordinary least squares 
regression estimate or the ridge regression estimate of bij as Wij . The adaptive 
lasso asymptotically selects the right set of such variables Xj that b^ is not zero, 
where k{j)<k{i). 

To apply the direct method |26| on data with more variables than obser- 
vations, we make three modifications. First, in cases with more variables than 
observations, the sample covariance matrix of variables is singular. Thus, we use 
ridge regression instead of linear least squares regression to compute regression 
residuals. Second, non-parametric independence measures, e.g., KGV [37] and 
HSIC [3D], require large sample sizes and much computational time. They could 
give a similar performance as simple nonlinear correlation measures for small 
sample sizes [3T[. Therefore, we use a nonlinear correlation [3TIE3 to evalu- 
ate independence. We first evaluate pairwise independence between a variable 
and each of the residuals and then take the sum of the pairwise independence 
measures over the residuals. Third, before applying adaptive lasso, we reduce 
the dimension of data to at most n— 1 using a dimension reduction method 
called iterative sure independence screening (ISIS) [33J so that the dimension 
is smaller than the sample size. This would be reasonable in estimation of 
gene networks since they are commonly assumed to be sparse. ISIS selects ex- 
planatory variables that have the first n/log(n) largest correlation coefficients 
with the explained variable in absolute value and apply lasso on the selected 
variables. It repeats this procedure on the residual of the explained variable 
over the selected explanatory variables until a desired number of variables, here 
n— 1 variables, are selected. We then apply lasso on the selected variables by 
ISIS and finally adaptive lasso on the variables still surviving. The combination 
of ISIS, lasso and adaptive lasso was suggested by [33J to accurately identify 
non-zero coefficients when the dimension is larger than the sample size. We 
select the regularization parameter for each method using BIC based on Gaus- 
sianity |34[|35| . Though our model assumes non-Gaussianity, we optimistically 
assume that the effect of misspecification of the model distribution might not 
be so serious since the lasso and adaptive lasso estimation only involves means 
and covariances of observed variables. Moreover, we can estimate total causal 
effects dij by applying ISIS, lasso and adaptive lasso of gene i on the set of gene 
j and the parents of gene j according to the back-door criterion [36) . 

We now present our algorithm to estimate the LiNGAM in Equation ([1} 
from data with more variables (or genes) than observations: 
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Input: Data matrix X 



1. Given a p-dimensional random vector x, a set of its variable subscripts U and 
a p x n data matrix of the random vector as X, initialize an ordered list of 
variables K := and m := 1. 

2. Repeat until p—1 subscripts are appended to K: 

(a) Denote by Xx a vector that collects all the variables in K. Perform 
ridge regression of Xj on xk with the ridge parameter t and compute 
the residual Xj for all j G U\K. Compute the matrix X that collects 
the values of Xj (j G U\K) from the data matrix X. 

(b) Perform ridge regression of Xi on [xj, x^] T with the ridge parameter r 

for all i G U\K (i ^ j) and compute the residual vectors r^=[r| ] 
and the residual data matrix from the matrix X for all j G U\K. 

(c) Find a variable a; m that is most independent of the residuals r| . 



where g(-) is tanh(-). 
(d) Append m to the end of K . 

3. Append the remaining variable subscript to the end of K . 

4. Estimate connection strengths or direct causal effects bij by doing adaptive 
lasso of Xi on all the variables with earlier causal orders than X{ based on 
the data matrix X for all i G U. In case of more parent candidates than 
observations, the dimension is reduced to at most n—1 by ISIS and lasso before 
applying adaptive lasso. The weights w%j for adaptive lasso are estimated by 
the inverses of the absolute values of ridge regression coefficients with the 
ridge parameter r. 

5. Estimate total causal effects by doing adaptive lasso of X{ on the set of Xj 
and its parent variables based on the data matrix X for all i, j G U(i ^ j). In 
case of more parent candidates than observations, the dimension is reduced to 
at most n — 1 by ISIS and lasso before applying adaptive lasso. The weights 
Wij for adaptive lasso are estimated by the inverses of the absolute values of 
ridge regression coefficients with the ridge parameter r. 

Output: A causal network given by the zero/non-zero pattern of B, direct causal 
effects bij and total causal effects ay between observed variables xi and Xj = 

1) ' ■ ■ ,P,i^ j)- 



X m = arg mm 

jeu\K 





+ |corr{a;j, 5 (r J UJ )} 



(2) 
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3 Experiments on artificial data 



As a sanity check of our method, we performed an experiment with synthetic 
data. The experiment consisted of 1001 trials. In each trial, we generated 
a dataset with dimension p = 100 and sample size n — 30 and applied our 
estimation method on the data. For comparison, we also tested three methods: 
1) random guessing, 2) lasso and 3) elastic net [37] ■ Regarding lasso and 
elastic net, we applied lasso or elastic net on every variable taking the other 
variables as its explanatory variables and considered that explanatory variables 
with non-zero regression coefficients have non-zero direct causal effects and non- 
zero total causal effects on the explained variable as in [5] and [38]. Note that the 
lasso- and elastic net-based approaches do not give a DAG network. We used the 
coordinate descent algorithm in [3S] of Matlab statistics toolbox to perform lasso 
and elastic net. Regarding clastic net, the weights of the lasso penalty and ridge 
penalty were set to be equal. All the rcgularization parameters were selected 
using BIC based on Gaussianity. The ridge parameter r for ridge regression was 
set to 0.01. Regarding random guessing, we first generated a random ordering of 
observed variables and then randomly created as many non-zero direct causal 
effects and non-zero total causal effects as LiNGAM found while keeping the 
network being acyclic. 

Each dataset was created as follows: 

1. We constructed the p x p connection strength matrix with all zeros and 
replaced every element in the lower-triangular part by independent real- 
izations of Bernoulli random variables with success probability s similarly 
to [10] • The probability s determines the sparseness of the model. The 
expected number of adjacent variables of each observed variable is given 
by s(p— 1). We randomly set the sparseness s so that the expected number 
of adjacent variables was 2 or 5. 

2. We replaced each non-zero entry in the connection strength matrix by a 
value randomly chosen from the interval [—1.5, —0.5] U [0.5, 1.5] and se- 
lected variances of the external influences from the interval [1,3]. The 
resulting matrix was used as the data-generating connection strength ma- 
trix B. 

3. We generated data with sample size n by independently drawing the ex- 
ternal influence variables e^. We randomly selected the distribution of 
each ei from a multimodal asymmetric mixture of two Gaussians, a mul- 
timodal symmetric mixture of two Gaussians and a Laplace distributions 
used in [57] . Then, we generated constants following the Gaussian distri- 
bution A(0, 4) and added them to the external influence variables as their 
means. 

4. The values of the observed variables Xi were generated using the connec- 
tion strength matrix B and external influences e^. Finally, we permuted 
the variables according to a random ordering. 
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Figure 1: Box plot of accuracies of non-zero direct causal effects bij. 

For every dataset, we computed the percentage of actually non-zero direct 
causal effects (or existing directed edges) in estimated ones, that is, the accuracy 
(or the true discovery rate) and the percentage of correctly estimated non-zero 
direct causal effects in actually non-zero ones, that is, the coverage. Fig. Q] 
and Fig. [2] show box plots of the accuracies and coverages of non-zero direct 
causal effects in B. The median accuracies of non-zero direct causal effects 
for the four methods, random guessing, lasso, elastic net, and LiNGAM, were 
0.017, 0.063, 0.047, and 0.104, respectively and the median coverages were 0.086, 
0.829, 0.912, and 0.489, respectively. Fig. [3] and Fig. H show box plots of the 
accuracies and coverages of non-zero total causal effects in A(= (I — B) _1 ). 
The median accuracies of non-zero total causal effects for random guessing, 
lasso, elastic net, and LiNGAM were 0.404, 0.462, 0.490, and 0.54, respectively 
and the median coverages were 0.085, 0.303, 0.473, and 0.142, respectively. The 
median computational times for the lasso, elastic net and LiNGAM were 60.13, 
48.36, and 728 seconds on a standard PC. 

In summary, our method gave better accuracies than the other methods. 
This implied that our method is more suitable for prioritizing future experi- 
ments. 

4 Conclusions 

We presented a new algorithm for estimating LiNGAM based on data with more 
variables than observations. Future works would include i) empirical comparison 
of our method and related algorithms on microarray gene expression datasets; 
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Figure 2: Box plot of coverages of non-zero direct causal effects hi 



Accuracy of A 



1 - 

0.9 - 

0.8 - 

0.7 - 

0.6 - 

0.5 - 

0.4 - 

0.3 - 

0.2 - 

0.1 - 

- 



Rnd 



Elastic 



LiNGAM 



Figure 3: Box plot of accuracies of non-zero total causal effects di 



ii) extensions of our method to cases with latent confounding variables [22] and 
cyclic relations [41 j - 



Coverage of A 




Figure 4: Box plot of coverages of non-zero total causal effects ay. 
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